Problem

Prediction

Outcome: Juvenile Survival

Possible predictors:

  • Birth date
  • Birth mass
  • Year
  • Sex
  • Maternal fecundity
  • Maternal reproductive status


  • Maternal age
  • Population size
  • Temperature (min, max, ave…)
  • Rainfall
  • Wind speed
  • …

Hierarchy of GLMs

What is multiple regression really doing?

Determine the association of each predictor while “controlling” for the other predictors.

How?

  • Allow the other variables to account for variation in the predictor of interest
    • Multiple regression of the predictor of interest on the remaining predictors (response variable not involved)
  • Regress the response on the residual variance of the predictor of interest

What is multiple regression really doing?

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4\]

To estimate the coefficient of \(X_1\):

  • Regress \(X_1\) on \(X_2 + X_3 + X_4\)

\[X_1 = \gamma_0 + \gamma_1 X_2 + \gamma_2 X_3 + \gamma_3 X_4\]

  • Calculate residuals for the model.
  • Regress \(Y\) on the residuals.
  • The estimated coefficient is \(\beta_1\).

Residuals

Residuals

Assumption: Residuals from OLS regression should be centered on zero and normally distributed.

Milk content across mammalian species

What are the contributions of the fat and lactose content of mammalian milk to total milk energy?

  • Outcome: Kilocalories of energy per gram of milk
  • Predictors:
    • Percent fat
    • Percent lactose

Milk content across mammalian species

milk <- read_excel("../data/Milk.xlsx", na = "NA")
glimpse(milk)
## Observations: 29
## Variables: 8
## $ clade          <chr> "Strepsirrhine", "Strepsirrhine", "Strepsirrhine"…
## $ species        <chr> "Eulemur fulvus", "E macaco", "E mongoz", "E rubr…
## $ kcal.per.g     <dbl> 0.49, 0.51, 0.46, 0.48, 0.60, 0.47, 0.56, 0.89, 0…
## $ perc.fat       <dbl> 16.60, 19.27, 14.11, 14.91, 27.28, 21.22, 29.66, …
## $ perc.protein   <dbl> 15.42, 16.91, 16.85, 13.18, 19.50, 23.58, 23.46, …
## $ perc.lactose   <dbl> 67.98, 63.82, 69.04, 71.91, 53.22, 55.20, 46.88, …
## $ mass           <dbl> 1.95, 2.09, 2.51, 1.62, 2.19, 5.25, 5.37, 2.51, 0…
## $ neocortex.perc <dbl> 55.16, NA, NA, NA, NA, 64.54, 64.54, 67.64, NA, 6…

Ignore for now that these are comparative species-level data.

Wrangling data

Keep:

  • species: Species
  • kcal.per.g: Kilocalories of energy per gram of milk
  • perc.fat: Percent fat
  • perc.lactose: Percent lactose

Filter complete cases (drop rows with NA).

M <- milk %>% select(species, kcal.per.g, perc.fat, perc.lactose) %>%
  drop_na()
names(M) <- c("Species", "Milk_Energy", "Fat", "Lactose")
glimpse(M)
## Observations: 29
## Variables: 4
## $ Species     <chr> "Eulemur fulvus", "E macaco", "E mongoz", "E rubrive…
## $ Milk_Energy <dbl> 0.49, 0.51, 0.46, 0.48, 0.60, 0.47, 0.56, 0.89, 0.91…
## $ Fat         <dbl> 16.60, 19.27, 14.11, 14.91, 27.28, 21.22, 29.66, 53.…
## $ Lactose     <dbl> 67.98, 63.82, 69.04, 71.91, 53.22, 55.20, 46.88, 30.…

Visualizing data

You must enable Javascript to view this page properly.

Visualizing data

library(GGally)
ggscatmat(as.data.frame(M), columns = 2:4)

Multiple regression

fm <- lm(Milk_Energy ~ Fat + Lactose, data = M)
summary(fm)
## 
## Call:
## lm(formula = Milk_Energy ~ Fat + Lactose, data = M)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11350 -0.05047  0.01103  0.04649  0.12701 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.007372   0.211168   4.770 6.16e-05 ***
## Fat          0.001952   0.002533   0.771  0.44784    
## Lactose     -0.008709   0.002575  -3.382  0.00229 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06447 on 26 degrees of freedom
## Multiple R-squared:  0.8518, Adjusted R-squared:  0.8404 
## F-statistic: 74.74 on 2 and 26 DF,  p-value: 1.657e-11

Visualizing multiple regression

You must enable Javascript to view this page properly.

Estimate Fat coefficient

  1. Use Lactose to predict Fat, which will take the effect of Lactose out of the model when we predict Milk_Energy
  2. Extract the residuals and add them onto the data.
fm_Lact <- lm(Fat ~ Lactose, data = M)
M$resid_Lact <- residuals(fm_Lact)
head(M)
## # A tibble: 6 x 5
##   Species            Milk_Energy   Fat Lactose resid_Lact
##   <chr>                    <dbl> <dbl>   <dbl>      <dbl>
## 1 Eulemur fulvus            0.49  16.6    68.0      0.196
## 2 E macaco                  0.51  19.3    63.8     -1.12 
## 3 E mongoz                  0.46  14.1    69.0     -1.28 
## 4 E rubriventer             0.48  14.9    71.9      2.27 
## 5 Lemur catta               0.6   27.3    53.2     -3.25 
## 6 Alouatta seniculus        0.47  21.2    55.2     -7.42

Estimate Fat coefficient

Estimate Fat coefficient

Estimate Fat coefficient

coef(lm(Milk_Energy ~ resid_Lact, data = M))
## (Intercept)  resid_Lact 
## 0.641724138 0.001952441
coef(fm)
##  (Intercept)          Fat      Lactose 
##  1.007371840  0.001952441 -0.008708827

Estimate Lactose coefficient

  1. Use Fat to predict Lactose, which will take the effect of Fat out of the model when we predict Milk_Energy
  2. Extract the residuals and add them onto the data.frame M.
fm_Fat <- lm(Lactose ~ Fat, data = M)
M$resid_Fat <- residuals(fm_Fat)

Estimate Lactose coefficient

Estimate Lactose coefficient

Estimate Lactose coefficient

coef(lm(Milk_Energy ~ resid_Fat, data = M))
##  (Intercept)    resid_Fat 
##  0.641724138 -0.008708827
coef(fm)
##  (Intercept)          Fat      Lactose 
##  1.007371840  0.001952441 -0.008708827

Compare

Compare

Multicollinearity

High correlation between predictors leaves little residual variation to be used for explaining the outcome variable.

Masking

Multiple predictors are useful for predicting outcomes when bivariate relationships with the response variable is not strong.

But:

  • Associative relationships can be obscured when two predictors are somewhat correlated with one another.

Mammal milk data in a different context

Milk is a big energetic investment

  • Is there a significant association between energy content of milk while controlling for neocortex size and body size?
  • Do primates with larger brains produce significantly more nutritious milk so their offspring can grow quickly (because they must grow quickly) ?

Visualizing

Visualizing

Bivariate model of Neocortex

fm_Neo <- lm(Milk_Energy ~ Neocortex, data = M)
summary(fm_Neo)
## 
## Call:
## lm(formula = Milk_Energy ~ Neocortex, data = M)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19027 -0.14693 -0.03744  0.15613  0.29959 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.353332   0.501120   0.705    0.492
## Neocortex   0.004503   0.007389   0.609    0.551
## 
## Residual standard error: 0.1764 on 15 degrees of freedom
## Multiple R-squared:  0.02417,    Adjusted R-squared:  -0.04089 
## F-statistic: 0.3715 on 1 and 15 DF,  p-value: 0.5513

Bivariate model of log Mass

fm_Mass <- lm(Milk_Energy ~ log_Mass, data = M)
summary(fm_Mass)
## 
## Call:
## lm(formula = Milk_Energy ~ log_Mass, data = M)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26908 -0.09190 -0.03189  0.13180  0.30209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.70516    0.05185  13.599 7.68e-10 ***
## log_Mass    -0.03169    0.02160  -1.467    0.163    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.167 on 15 degrees of freedom
## Multiple R-squared:  0.1255, Adjusted R-squared:  0.0672 
## F-statistic: 2.153 on 1 and 15 DF,  p-value: 0.163

Multivariate model

fm_Multi <- lm(Milk_Energy ~ Neocortex + log_Mass, data = M)
summary(fm_Multi)
## 
## Call:
## lm(formula = Milk_Energy ~ Neocortex + log_Mass, data = M)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.250574 -0.039212  0.000633  0.072997  0.201985 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.085254   0.515281  -2.106  0.05372 . 
## Neocortex    0.027931   0.008015   3.485  0.00364 **
## log_Mass    -0.096402   0.024749  -3.895  0.00162 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1265 on 14 degrees of freedom
## Multiple R-squared:  0.5317, Adjusted R-squared:  0.4648 
## F-statistic: 7.948 on 2 and 14 DF,  p-value: 0.004939

Interpretation

  • Both coefficients go up
    • Neocortex: \(0.005 \rightarrow 0.03\) (P = 0.004)
    • log Mass: \(-0.03 \rightarrow -0.1\) (P = 0.002)

Regression asks (and answers):

  1. Do species that have high neocortex percentage for their mass have higher energy milk?
  2. Do species with high body mass for their neocortex percentage have higher energy milk?

Neocortex vs. log Mass

Milk Energy vs. Residual Mass

Quiz 07-3

Complete Quiz 07-3

Watch Lecture 07-4